72 research outputs found

    Approximately Minwise Independence with Twisted Tabulation

    Full text link
    A random hash function hh is ε\varepsilon-minwise if for any set SS, S=n|S|=n, and element xSx\in S, Pr[h(x)=minh(S)]=(1±ε)/n\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n. Minwise hash functions with low bias ε\varepsilon have widespread applications within similarity estimation. Hashing from a universe [u][u], the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes c=O(1)c=O(1) lookups in tables of size u1/cu^{1/c}. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing requires Ω(logu)\Omega(\log u)-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields O~(1/n1/c)\tilde O(1/n^{1/c})-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

    On the k-Independence Required by Linear Probing and Minwise Independence

    Full text link

    Comparison of Failures and Attacks on Random and Scale-Free Networks

    No full text
    It appeared recently that some statistical properties of complex networks like the Internet, the World Wide Web or Peer-to-Peer systems have an important influence on their resilience to failures and attacks. In particular, scale-free networks (i.e. networks with power-law degree distribution) seem much more robust than random networks in case of failures, while they are more sensitive to attacks. In this paper we deepen the study of the differences in the behavior of these two kinds of networks when facing failures or attacks. We moderate the general affirmation that scale-free networks are much more sensitive than random networks to attacks by showing that the number of links to remove in both cases is similar, and by showing that a slightly modified scenario for failures gives results similar to the ones for attacks. We also propose and analyze an efficient attack strategy against links

    Dictionary matching in a stream

    Get PDF
    We consider the problem of dictionary matching in a stream. Given a set of strings, known as a dictionary, and a stream of characters arriving one at a time, the task is to report each time some string in our dictionary occurs in the stream. We present a randomised algorithm which takes O(log log(k + m)) time per arriving character and uses O(k log m) words of space, where k is the number of strings in the dictionary and m is the length of the longest string in the dictionary

    Counting approximately-shortest paths in directed acyclic graphs

    Full text link
    Given a directed acyclic graph with positive edge-weights, two vertices s and t, and a threshold-weight L, we present a fully-polynomial time approximation-scheme for the problem of counting the s-t paths of length at most L. We extend the algorithm for the case of two (or more) instances of the same problem. That is, given two graphs that have the same vertices and edges and differ only in edge-weights, and given two threshold-weights L_1 and L_2, we show how to approximately count the s-t paths that have length at most L_1 in the first graph and length at most L_2 in the second graph. We believe that our algorithms should find application in counting approximate solutions of related optimization problems, where finding an (optimum) solution can be reduced to the computation of a shortest path in a purpose-built auxiliary graph

    Scalable Mining of Common Routes in Mobile Communication Network Traffic Data

    Get PDF
    A probabilistic method for inferring common routes from mobile communication network traffic data is presented. Besides providing mobility information, valuable in a multitude of application areas, the method has the dual purpose of enabling efficient coarse-graining as well as anonymisation by mapping individual sequences onto common routes. The approach is to represent spatial trajectories by Cell ID sequences that are grouped into routes using locality-sensitive hashing and graph clustering. The method is demonstrated to be scalable, and to accurately group sequences using an evaluation set of GPS tagged data

    Summary cache: a scalable wide-area Web cache sharing protocol

    Full text link

    Cross-language high similarity search using a conceptual thesaurus

    Full text link
    This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially funded by the European Commission as part of the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise 2.0 research project (TIN2009-13391-C04-03). The research work of the second author is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748

    Analysis of Agglomerative Clustering

    Full text link
    The diameter kk-clustering problem is the problem of partitioning a finite subset of Rd\mathbb{R}^d into kk subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of kk) is the agglomerative clustering algorithm with the complete linkage strategy. For decades, this algorithm has been widely used by practitioners. However, it is not well studied theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the dimension dd is a constant, we show that for any kk the solution computed by this algorithm is an O(logk)O(\log k)-approximation to the diameter kk-clustering problem. Our analysis does not only hold for the Euclidean distance but for any metric that is based on a norm. Furthermore, we analyze the closely related kk-center and discrete kk-center problem. For the corresponding agglomerative algorithms, we deduce an approximation factor of O(logk)O(\log k) as well.Comment: A preliminary version of this article appeared in Proceedings of the 28th International Symposium on Theoretical Aspects of Computer Science (STACS '11), March 2011, pp. 308-319. This article also appeared in Algorithmica. The final publication is available at http://link.springer.com/article/10.1007/s00453-012-9717-
    corecore